AITopics | accurate stochastic gradient estimation

Fast and Accurate Stochastic Gradient Estimation

Neural Information Processing SystemsDec-25-2025, 19:52:16 GMT

Stochastic Gradient Descent or SGD is the most popular optimization algorithm for large-scale problems. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch-wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribution for gradient estimation is more than calculating the full gradient itself, which we call the chicken-and-the-egg loop. As a result, the false impression of faster convergence in iterations, in reality, leads to slower convergence in time. In this paper, we break this barrier by providing the first demonstration of a scheme, Locality sensitive hashing (LSH) sampled Stochastic Gradient Descent (LGD), which leads to superior gradient estimation while keeping the sampling cost per iteration similar to that of the uniform sampling. Such an algorithm is possible due to the sampling view of LSH, which came to light recently. As a consequence of superior and fast estimation, we reduce the running time of all existing gradient descent algorithms, that relies on gradient estimates including Adam, Ada-grad, etc. We demonstrate the effectiveness of our proposal with experiments on linear models as well as the non-linear BERT, which is a recent popular deep learning based language representation model.

accurate stochastic gradient estimation, convergence, name change, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Reviews: Fast and Accurate Stochastic Gradient Estimation

Neural Information Processing SystemsJan-26-2025, 04:06:06 GMT

Summary: This paper develops a new method for adaptively sampling training examples during stochastic optimization. It is known that the optimal distribution that minimizes the nuclear norm of the covariance of the gradient estimate is one where the the probability of sampling an example is proportional to the magnitude of the gradient of the loss on that example. Sampling according to this distribution is of course impractical, because computing this distribution is as expensive as computing the full gradient and requires O(N) time per iteration. To get around this, prior work either maintains a fixed distribution across all iterations or makes strong assumptions on the distribution of gradients of different training examples (e.g.: the gradients of training examples of the same class are similar). This paper proposes a method that can adaptively sample from different distributions every iteration and requires little assumptions on the distribution of gradients, and yet requires the same per-iteration cost as SGD.

gradient estimate, nuclear norm, probability, (13 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Reviews: Fast and Accurate Stochastic Gradient Estimation

Neural Information Processing SystemsJan-26-2025, 04:05:54 GMT

This paper received extensive discussion by the reviewers, the meta-reviewer, the SPC, etc. Here is a meta-review summary. The paper considers the problem of adaptively sampling training examples in stochastic optimization, and it shows that it is possible to do so without a per-iteration cost of O(N). This is of interest by itself, since one typically thinks that such sampling requires maintaining a distribution over training examples, which requires O(N) in every iteration, i.e., which is as expensive as full-batch gradient descent. A second aspect of this paper is that the mechanism by which the authors accomplish this is to use LSH, which is a sketching method usually used for nearest neighbor search.

accurate stochastic gradient estimation, training example

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)

Add feedback

Fast and Accurate Stochastic Gradient Estimation

Neural Information Processing SystemsOct-10-2024, 15:51:43 GMT

Stochastic Gradient Descent or SGD is the most popular optimization algorithm for large-scale problems. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch-wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribution for gradient estimation is more than calculating the full gradient itself, which we call the chicken-and-the-egg loop. As a result, the false impression of faster convergence in iterations, in reality, leads to slower convergence in time.

accurate stochastic gradient estimation, convergence, stochastic gradient descent, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.69)

Add feedback

Fast and Accurate Stochastic Gradient Estimation

Chen, Beidi, Xu, Yingchen, Shrivastava, Anshumali

Neural Information Processing SystemsMar-19-2020, 01:45:52 GMT

Stochastic Gradient Descent or SGD is the most popular optimization algorithm for large-scale problems. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch-wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribution for gradient estimation is more than calculating the full gradient itself, which we call the chicken-and-the-egg loop. As a result, the false impression of faster convergence in iterations, in reality, leads to slower convergence in time.

accurate stochastic gradient estimation, convergence, stochastic gradient descent, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.68)

Add feedback

Filters

Collaborating Authors

accurate stochastic gradient estimation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Fast and Accurate Stochastic Gradient Estimation

Reviews: Fast and Accurate Stochastic Gradient Estimation

Reviews: Fast and Accurate Stochastic Gradient Estimation

Fast and Accurate Stochastic Gradient Estimation

Fast and Accurate Stochastic Gradient Estimation